1, create the dataframe from the list
Each element of the list is converted to a row object, and the Parallelize () function converts the list to the RDD,TODF () function to convert the RDD to Dataframe
From Pyspark.sql import Row
L=[row (name= ' Jack ', age=10), Row (Name= ' Lucy ', age=12)]
Df=sc.parallelize (L). TODF ()
There is no schema for creating the data in the Dataframe:rdd from the Rdd, using ro
," Logistic regression models is neat ")). TODF (" label "," sentence ") Val tokenizer = New Tokenizer (). Setinputcol ("sentence"). Setoutputcol ("words") val Wordsdata = Tokenizer.transform (Sentencedata) Val HASHINGTF = new HASHINGTF (). Setinputcol ("words"). Setoutputcol ("Rawfeatures"). Setnumfeatures (+) Val featurizeddata = Hashingtf.transform (wordsdata)//Alternatively, Countvectorizer can also be used to get term frequency vectors val IDF =
wish Java could with Case c Lasses "),
(1.0," Logistic regression models is neat "))
. TODF (" label "," sentence ")
val tokenizer = new Tokenizer (). Setinputcol ("sentence"). Setoutputcol ("words")
val wordsdata = Tokenizer.transform (sentencedata)
val HASHINGTF = new HASHINGTF ().
Setinputcol ("words"). Setoutputcol ("Rawfeatures"). Setnumfeatures
Val featurizeddata = Hashingtf.transform (wordsdata)
//Alternatively, Countvectorizer can also
Org.apache.spark.ml.linalg. {Vector, Vectors} import org.apache.spark.ml.param.ParamMap import Org.apache.spark.sql.Row//Prepare training data fro
M a list of (label, features) tuples.
Val training = Spark.createdataframe (Seq (1.0, Vectors.dense (0.0, 1.1, 0.1)), (0.0, Vectors.dense (2.0, 1.0,-1.0)), (0.0, Vectors.dense (2.0, 1.3, 1.0)), (1.0, Vectors.dense (0.0, 1.2, -0.5))). TODF ("label", "Features")//Create a Log Isticregression instance.
This
) Case Class Brower (V1:string, V2:stri ng,v3:string,v4:string,v5:string,v6:string) def main (args:array[string]): Unit = {val conf = new sparkconf (). Setap PName ("Readjson"). Setmaster ("local"). Set ("Spark.executor.memory", "50g"). Set ("Spark.driver.maxResultSize", "50g" Val sc = new Sparkcontext (conf) val sqlcontext = new SqlContext (SC) Implicit conversion import sqlcontext.implicits._ val UserInfo = sc.textfile ("c:\\users\\bigdata\\desktop\\ file \\BigData\\Spark\ \3.sparkcore_2\\dat
todf () method, an implicit conversion is required, and an array is formed after the map import sqlcontext.implicits._ val DF: DataFrame = sc.textfile ( " c:\\users\\ Wangyongxiang\\desktop\\plan\\person.txt "). Map (_.split (" ")". Map (P = Person (P (0 ), p (1 ). Trim.toint). TODF () // another form of the second method, with SqlContext or sparksession createdataframe (), is in fact identical to
type is not available, the custom bean does not work//The Official document also has an example of writing a dataset through the bean, but I do not succeed in running it//so I currently need to create a Datafra Me method to create Dataset[row]//Sqlcontext.createdataset (Idagerddrow)//currently supports string, Integer, long, etc. type directly create DataSet Se Q (1,2, 3). ToDS (). Show () Sqlcontext.createdataset (Sc.parallelize (Array (1, 2, 3)). Show ()}}
But it's actually a dataset, bec
Tags: table name examples path Builder list defines an AC tin. sqlFirst we're going to create sparksession Val spark = Sparksession.builder ()
. AppName ("Test").
Master ("local")
. Getorcreate ()
Import Spark.implicits._//Convert RDD into dataframe and support SQL operations
Then we create dataframe through sparksession. 1. toDF Creating Dataframe using Functions by impo
+ = ("path"-> path)
Save () c8/>}
2. Trace the Save method.
/**
* Saves the content of the [[Dataframe]] as the specified table.
*
* @since 1.4.0
/
def Save (): unit = {
Resolveddatasource (
df.sqlcontext,
source,
Partitioningcolumns.map (_.toarray). Getorelse (Array.empty[string]),
mode,
Extraoptions.tomap,
DF)
}
3. Where source is Sqlconf's defaultdatasourcenameprivate var source:string = Df.sqlContext.conf.defaultDataSourceNameWhere the default_data_sour
. ValRecommondlist = Sc.parallelize (Movies_Map.keys.filter (Myratedmovieids.contains (_)). Toseq)//To select the highest rated 10 records output by scoring the result data from the big and smallBestmodel.predict (Recommondlist.map (0, _))). Collect (). SortBy (-_.rating). Take (Ten). foreach {r = println ("%2d". Format (i) +"---------->: \nmovie name --"+ Movies_map (r.product) +"\nmovie type ---"+ Moviestype_map (r.product)) i + =1}//Calculate the people who may be interestedprintln"Interes
operation mode. Dataframe provides a number of ways to manipulate data, such as Where,select2.DSL mode. The DSL actually uses the method provided by Dataframe, but it is easy to manipulate the properties by using the ' + property name '3. Register data as a table and manipulate it with SQL statementsObject textfile{def main (args:array[string]) {//First step //Build Sparkcontext object, mainly use new to call the construction method, otherwise it becomes the Apply method of using the sam
This article mainly implements the stochastic forest algorithm in the Pyspark environment:
%pyspark from Pyspark.ml.linalg import Vectors to pyspark.ml.feature import stringindexer from Pyspark.ml.classificati On the import randomforestclassifier from pyspark.sql import Row #任务目标: Solve two classification problems through random forests and evaluate #1 of classification effects. Read data = Spark.sql ("" "Sele CT * from DataTable "" "#2. Construct Training DataSet = Data.na.fill (' 0 '). Rdd.m
," Spark compile ", 1.0), (11L," Hadoop Software ", 0.0)). TODF (" id "," text "," label ")//Configure an ML pipeline, which consists of three stages:tokenizer, ha
SHINGTF, and LR. Val tokenizer = new Tokenizer (). Setinputcol ("text"). SetouTputcol ("words") val HASHINGTF = new HASHINGTF (). Setinputcol (Tokenizer.getoutputcol). Setoutputcol ("Features") Val LR
= new Logisticregression () Setmaxiter val pipeline = new Pipeline (). Setstages (Array (
(Df1 ("Masterhotel"), Df1 ("Order_cii_notcancelcii"), Df1 ("Rank1"), Df1 ("OrderDate"))
Val actual_frame=data2.todf ()
Building Dataframe Type Result sets
Case Class ResultSet (Masterhotel:int,//Parent Hotel ID
Quantity:double,//Real output
Rank:int,//Sort
Date:string,//Date
Frcst_cii:double//Forecast output
)
Val Ac_1=actual_frame.collect ()
Val pr_1=predtrain.collect () (0)
Val output0= (0 until Ac_1.length). Map (I =>resultset (ac_1 (i) (0
", "Favorite_Color"). ShowUsersdf.select ("name", "Favorite_Color"). Write.save ("/root/temp/result")2. Parquet file: A data source loaded by default for the Sparksql load function, files stored by columnHow do I convert other file formats to parquet files?Example: JSON file---->parquet fileVal Empjson = Spark.read.json ("/root/temp/emp.json") #直接读取一个具有格式的数据文件作为DataFrameEmpJSON.write.parquet ("/root/temp/empparquet") #/empparquet directory cannot exist beforehandor EmpJSON.wirte.mode ("overwrite
the tree structure to print9, registertemptable (tablename:string) return unit, the DF object is placed in only one table, the table with the deletion of the object deleted10. The schema returns the Structtype type, returning the field name and type according to the struct type11, TODF () returns a new dataframe type of12, TODF (colnames:string*) returns several fields in the parameter to a new dataframe t
This article mainly implements the GBDT algorithm in the Pyspark environment, the implementation code looks like this:
%pyspark from Pyspark.ml.linalg import Vectors to pyspark.ml.classification import Gbtclassifier from Pyspark.ml.featu Re import stringindexer from NumPy import allclose from pyspark.sql.types Import * #1. Read data = Spark.sql ("" "SELECT * F Rom XXX "" "#2. Constructs the training DataSet = Data.rdd.map (list) (Traindata, testData) = Dataset.randomsplit ([0.75, 0.25]) Train
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.